Sometimes, while analyzing a dataset, there can be some data present which might exert undue influence while building models, like linear regression. These data are called outliers. Outliers can sometimes mislead the set of data and influence model performance as well.
First, lets learn about some basics about outliers.
Click on the "Pre Knowledge" section above to know the basics about outliers
In data science, outliers are values within a dataset that vary greatly from the others, they are either much larger, or significantly smaller. Outliers can appear in a dataset due to variability of measurement, error in data, experimental error etc. Outliers can cause machine learning models to make inaccurate predictions when they are included in the training data, so they need to be handled before training a model.
One of the best ways to understand outliers is box plots.
Boxplots are very useful to see the distribution of a variable/feature and detect outliers in them. It is a useful graphical representation for describing the behavior of the data in the middle as well as both ends of the distribution. A box plot shows the data based on the five-number summary:
The difference between the lower quartile and the upper quartile(Q3 - Q1) is called the interquartile range or IQR.
Boxplots help us find the outliers in the data by using the IQR. As a rule, values that are outside the range of 1.5*IQR from Q1 and Q3 are regarded as outliers. The below image will help us better understand the outliers in our data.
In the image above, the points that are outside the whisker lines are the outliers.
There are different techniques to handle outliers in a dataset. In our example, we will use the concept of clipping (winsorizing).
Clipping data from a dataset means to clip the data at the last permitted extreme value, e.g. the 5th or 95th percentile value. For example, when we clip the data to 95th percentile, values over the 95th percentile will be set to the 95th percentile value meaning all the values greater than 95% percent will equal to the 95th percentile value.
The following data set has several (bolded) extremes:
After clipping/winsorizing the top and bottom 10% of the data(matching those values to the nearest extreme), we get:
Let us solve a problem that replaces outliers from data using clipping.
Let us solve a problem that replaces outliers from data using clipping.
We have a dataset named nyc_airbnb.csv, which contains data about price of AirBnb houses per-night. In the dataset, we want to analyze the price feature data. Before analysis, we want to make sure there exists no outliers in the price data. If there exists any outliers,we want to remove those outliers by using the winsorizing/clipping method.
First , we load our dataset into a dataframe and view it.
Step 1: import the pandas library as pd
import pandas as pd
Step 2: Load the data into a dataframe nyc using read_csv method in pandas
nyc= pd.read_csv("../datasets/nyc_airbnb.csv")
Step 3: View the data stored in dataframe nyc
nyc
| id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2539 | Clean & quiet apt home by the park | 2787 | John | Brooklyn | Kensington | 40.64749 | -73.97237 | Private room | 149 | 1 | 9 | 2018-10-19 | 0.21 | 6 | 365 |
| 1 | 2595 | Skylit Midtown Castle | 2845 | Jennifer | Manhattan | Midtown | 40.75362 | -73.98377 | Entire home/apt | 225 | 1 | 45 | 2019-05-21 | 0.38 | 2 | 355 |
| 2 | 3647 | THE VILLAGE OF HARLEM....NEW YORK ! | 4632 | Elisabeth | Manhattan | Harlem | 40.80902 | -73.94190 | Private room | 150 | 3 | 0 | NaN | NaN | 1 | 365 |
| 3 | 3831 | Cozy Entire Floor of Brownstone | 4869 | LisaRoxanne | Brooklyn | Clinton Hill | 40.68514 | -73.95976 | Entire home/apt | 89 | 1 | 270 | 2019-07-05 | 4.64 | 1 | 194 |
| 4 | 5022 | Entire Apt: Spacious Studio/Loft by central park | 7192 | Laura | Manhattan | East Harlem | 40.79851 | -73.94399 | Entire home/apt | 80 | 10 | 9 | 2018-11-19 | 0.10 | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 48890 | 36484665 | Charming one bedroom - newly renovated rowhouse | 8232441 | Sabrina | Brooklyn | Bedford-Stuyvesant | 40.67853 | -73.94995 | Private room | 70 | 2 | 0 | NaN | NaN | 2 | 9 |
| 48891 | 36485057 | Affordable room in Bushwick/East Williamsburg | 6570630 | Marisol | Brooklyn | Bushwick | 40.70184 | -73.93317 | Private room | 40 | 4 | 0 | NaN | NaN | 2 | 36 |
| 48892 | 36485431 | Sunny Studio at Historical Neighborhood | 23492952 | Ilgar & Aysel | Manhattan | Harlem | 40.81475 | -73.94867 | Entire home/apt | 115 | 10 | 0 | NaN | NaN | 1 | 27 |
| 48893 | 36485609 | 43rd St. Time Square-cozy single bed | 30985759 | Taz | Manhattan | Hell's Kitchen | 40.75751 | -73.99112 | Shared room | 55 | 1 | 0 | NaN | NaN | 6 | 2 |
| 48894 | 36487245 | Trendy duplex in the very heart of Hell's Kitchen | 68119814 | Christophe | Manhattan | Hell's Kitchen | 40.76404 | -73.98933 | Private room | 90 | 7 | 0 | NaN | NaN | 1 | 23 |
48895 rows × 16 columns
Since we are looking to find outlier existence in hotel price , an
effective way of detecting outliers is using visualizations. To
check if there exists any outliers in price data,
strip plots can be a very useful graph to see how the datapoints
of a feature/variable is spread.
Strip Plot Information
A strip plot is a single-axis scatter plot that is used to visualise the distribution of many individual one-dimensional values. The values are plotted as dots along one unique axis, and the dots with the same value can overlap. It useful for observing variability, clustering, and outliers in small datasets.
For our strip plot, we visualize every datapoint of the price data. We look at the spread of price data in the y axis. For this plot, we import the plotly express library.
Step 1: First, import the plotly.express library as px
import plotly.express as px
Step 2: Call the strip() method and store it in variable price_strip
Explanation: Using px, call the
strip() method with the following parameters
nyc: variable where the data is storedprice: the data column/feature to plot in the y
axis
Explanation: Store the resulting plot into
variable price_strip.
Step 3: Display the variable price_strip using the show() method